In this hands-on session, we will show how principal component analysis (PCA) is used in common bioinformatics workflows. We will use a toy single-cell RNA-seq (scRNA-seq) dataset to illustrate it.
The main objectives are the following:
prcomp function.Single-cell RNA-seq allows the gene expression profiling of thousands of cells, one a time. It is often exemplified using the following analogy:
The scRNA-seq pipeline can be summarized in these 4 steps:
Image extracted from this paper.
scRNA-seq is a double-edged sword. On the one hand, scRNA-seq provides unprecedented discriminative power to chart all the cell types and states in a given tissue. This has catalyzed the creation of the Human Cell Atlas (HCA), which aims to classify all cells in the human body using scRNA-seq; and has been coined as āthe periodic table of human cellsā.
On the other hand, however, scRNA-seq has a high degree of technical variability, which can be introduced at multiple points of the aforementioned pipeline. Some examples include the following:
In this scenario, PCA is used for 3 main purposes:
As an example, we will create an expression matrix with the following cells:
library(pheatmap)
library(tidyverse)
## āā Attaching packages āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā tidyverse 1.3.0 āā
## ā ggplot2 3.3.2 ā purrr 0.3.4
## ā tibble 3.0.3 ā dplyr 1.0.2
## ā tidyr 1.1.2 ā stringr 1.4.0
## ā readr 1.3.1 ā forcats 0.5.0
## Warning: package 'tibble' was built under R version 4.0.2
## Warning: package 'tidyr' was built under R version 4.0.2
## Warning: package 'dplyr' was built under R version 4.0.2
## āā Conflicts āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā tidyverse_conflicts() āā
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
# Create toy dataset
toy <- data.frame(
CD3D = c(4, 5, 4, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
CD3E = c(5, 3, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
LYZ = c(0, 0, 0, 0, 5, 7, 6, 0, 0, 0, 0, 0, 0, 0),
S100A8 = c(0, 0, 0, 0, 5, 7, 6, 0, 0, 0, 0, 0, 0, 0),
CD79A = c(0, 0, 0, 0, 0, 0, 0, 6, 4, 4, 3, 5, 0, 0),
CD79B = c(0, 0, 0, 0, 0, 0, 0, 5, 5, 3, 4, 6, 0, 0),
BLNK = c(0, 0, 0, 0, 0, 0, 0, 6, 4, 5, 5, 4, 0, 0),
TOP2A = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 12, 9, 0, 0),
MKI67 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 11, 10, 0, 0),
MT_CO1 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 18, 23),
MT_ND5 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 22, 25)
)
rownames(toy) <- paste("cell", as.character(1:nrow(toy)))
toy <- as.matrix(toy)
# Visualize
pheatmap(toy, cluster_cols = FALSE, cluster_rows = FALSE, angle_col = 45)
In the following image you can visualize the redundancy between CD79A and CD79B. Together, they form the heterodimer CD79; which means that for every molecule of CD79A we need one of CD79B. Under the hood they are the same variable, which should be captured by a single principal component. A good analogy would be to measure the height of a person in feet and in cm.
Images obtained from this link
Three ways to perform PCA in R:
prcomp function.Regardless of the method, we always need to center the data:
toy_centered <- scale(toy, center = TRUE, scale = FALSE)
toy_centered[1:6,1:6]
## CD3D CD3E LYZ S100A8 CD79A CD79B
## cell 1 2.857143 3.857143 -1.285714 -1.285714 -1.571429 -1.642857
## cell 2 3.857143 1.857143 -1.285714 -1.285714 -1.571429 -1.642857
## cell 3 2.857143 2.857143 -1.285714 -1.285714 -1.571429 -1.642857
## cell 4 1.857143 2.857143 -1.285714 -1.285714 -1.571429 -1.642857
## cell 5 -1.142857 -1.142857 3.714286 3.714286 -1.571429 -1.642857
## cell 6 -1.142857 -1.142857 5.714286 5.714286 -1.571429 -1.642857
Now we will compute the exact same thing (PCA) in the 3 ways.
In each we will be tracking the same three pieces of information:
We will be computing an eigenbasis for the covariance matrix \(\Omega = \frac{1}{n-1}A^tA\) where \(A\) is the centered data.
cov_matrix <- (1 / (nrow(toy_centered) - 1)) * (t(toy_centered) %*% toy_centered)
eigen_cov_matrix <- eigen(cov_matrix)
pdir_eigen <- eigen_cov_matrix$vectors
pdir_eigen[1:6,1:6]
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] -0.03621728 -0.1281224 -0.2892754 0.37018269 -0.1942142 -0.4178419
## [2,] -0.03621728 -0.1281224 -0.2892754 0.37018269 -0.1942142 -0.4178419
## [3,] -0.04319892 -0.1949563 0.5855971 -0.07543384 -0.1254525 -0.2728953
## [4,] -0.04319892 -0.1949563 0.5855971 -0.07543384 -0.1254525 -0.2728953
## [5,] -0.06565032 0.2467930 -0.1395316 -0.42583898 -0.3020970 -0.3628173
## [6,] -0.06999161 0.2975760 -0.1023968 -0.35142006 -0.6180140 0.1555483
var_eigen <- eigen_cov_matrix$values
head(var_eigen)
## [1] 133.1235625 36.7772802 14.4225675 10.9965640 0.4843188 0.2349899
scores_eigen <- toy_centered %*% pdir_eigen
scores_eigen[1:6, 1:6]
## [,1] [,2] [,3] [,4] [,5] [,6]
## cell 1 -3.951320 -3.948608 -3.661022 3.6545206 -0.25850086 -0.47150505
## cell 2 -3.915103 -3.820486 -3.371747 3.2843379 -0.06428666 -0.05366314
## cell 3 -3.915103 -3.820486 -3.371747 3.2843379 -0.06428666 -0.05366314
## cell 4 -3.878885 -3.692364 -3.082472 2.9141552 0.12992755 0.36417876
## cell 5 -4.057353 -4.745070 4.798428 -0.4314620 0.23490185 0.56011951
## cell 6 -4.230149 -5.524895 7.140816 -0.7331974 -0.26690820 -0.53146153
We can apply SVD to the centered data \(A\). Observe that if we want the singular values to reflect the same statistical information as above, we need to correct by a factor of \(1/\sqrt{n-1}\) where \(n\) is the number of rows of \(A\).
toy_to_svd <- toy_centered
toy_svd <- svd(toy_to_svd, nu = nrow(toy_to_svd))
pdir_svd <- toy_svd$v
pdir_svd[1:6,1:6]
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] -0.03621728 -0.1281224 0.2892754 0.37018269 0.1942142 -0.4178419
## [2,] -0.03621728 -0.1281224 0.2892754 0.37018269 0.1942142 -0.4178419
## [3,] -0.04319892 -0.1949563 -0.5855971 -0.07543384 0.1254525 -0.2728953
## [4,] -0.04319892 -0.1949563 -0.5855971 -0.07543384 0.1254525 -0.2728953
## [5,] -0.06565032 0.2467930 0.1395316 -0.42583898 0.3020970 -0.3628173
## [6,] -0.06999161 0.2975760 0.1023968 -0.35142006 0.6180140 0.1555483
d_adjusted <- toy_svd$d * (1 / sqrt(nrow(toy) - 1))
var_svd <- d_adjusted ** 2
head(var_svd)
## [1] 133.1235625 36.7772802 14.4225675 10.9965640 0.4843188 0.2349899
sigma <- diag(toy_svd$d, nrow(toy_svd$u), ncol(toy_svd$v))
scores_svd <- toy_svd$u %*% sigma
scores_svd[1:6,1:6]
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] -3.951320 -3.948608 3.661022 3.6545206 0.25850086 -0.47150505
## [2,] -3.915103 -3.820486 3.371747 3.2843379 0.06428666 -0.05366314
## [3,] -3.915103 -3.820486 3.371747 3.2843379 0.06428666 -0.05366314
## [4,] -3.878885 -3.692364 3.082472 2.9141552 -0.12992755 0.36417876
## [5,] -4.057353 -4.745070 -4.798428 -0.4314620 -0.23490185 0.56011951
## [6,] -4.230149 -5.524895 -7.140816 -0.7331974 0.26690820 -0.53146153
Stand-alone PCA function provided by R:
pca_out <- prcomp(toy, center = TRUE)
Use the following keywords to parse the output instance: * principal directions: ārotationā * standard deviations a.k.a. singular values: āsdevā * PC scores: āxā
pdir_pca <- pca_out$rotation
pdir_pca[1:6,1:6]
## PC1 PC2 PC3 PC4 PC5 PC6
## CD3D -0.03621728 -0.1281224 0.2892754 0.37018269 0.1942142 -0.4178419
## CD3E -0.03621728 -0.1281224 0.2892754 0.37018269 0.1942142 -0.4178419
## LYZ -0.04319892 -0.1949563 -0.5855971 -0.07543384 0.1254525 -0.2728953
## S100A8 -0.04319892 -0.1949563 -0.5855971 -0.07543384 0.1254525 -0.2728953
## CD79A -0.06565032 0.2467930 0.1395316 -0.42583898 0.3020970 -0.3628173
## CD79B -0.06999161 0.2975760 0.1023968 -0.35142006 0.6180140 0.1555483
var_pca <- pca_out$sdev ** 2
head(var_pca)
## [1] 133.1235625 36.7772802 14.4225675 10.9965640 0.4843188 0.2349899
scores_pca <- pca_out$x
scores_pca[1:6,1:6]
## PC1 PC2 PC3 PC4 PC5 PC6
## cell 1 -3.951320 -3.948608 3.661022 3.6545206 0.25850086 -0.47150505
## cell 2 -3.915103 -3.820486 3.371747 3.2843379 0.06428666 -0.05366314
## cell 3 -3.915103 -3.820486 3.371747 3.2843379 0.06428666 -0.05366314
## cell 4 -3.878885 -3.692364 3.082472 2.9141552 -0.12992755 0.36417876
## cell 5 -4.057353 -4.745070 -4.798428 -0.4314620 -0.23490185 0.56011951
## cell 6 -4.230149 -5.524895 -7.140816 -0.7331974 0.26690820 -0.53146153
# PCA centered and scaled
pca_out_scaled <- prcomp(toy, center = TRUE, scale. = TRUE)
Importantly, we obtain the following:
pca_out$rotationpca_out$xpca_out$sdev ** 2In addition, we can get the proportion of variance explained (PVE) and the cumulative proportion with the summary function (more info in this book):
summary(pca_out_scaled)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.1414 1.6160 1.5448 1.1065 0.28172 0.25075 0.19522
## Proportion of Variance 0.4169 0.2374 0.2170 0.1113 0.00722 0.00572 0.00346
## Cumulative Proportion 0.4169 0.6543 0.8712 0.9826 0.98977 0.99548 0.99895
## PC8 PC9 PC10 PC11
## Standard deviation 0.09752 0.04135 0.01862 1.383e-16
## Proportion of Variance 0.00086 0.00016 0.00003 0.000e+00
## Cumulative Proportion 0.99981 0.99997 1.00000 1.000e+00
To reduce the dimensionality, we will choose the number of PC that explains most of the variance in the dataset. To this end, we use a so-called elbow plot:
# Calculate percentage of variance explained (PVE):
pve <- pca_out_scaled$sdev ** 2 / sum(pca_out$sdev ** 2) * 100
pve <- round(pve, 2)
pve_df <- data.frame(principal_component = 1:length(pve), pve = pve)
pve_gg <- pve_df %>%
ggplot(aes(principal_component, pve)) +
geom_point() +
geom_vline(xintercept = 4.5, linetype = "dashed", color = "darkblue") +
scale_x_continuous(breaks = 1:length(pve)) +
labs(x = "Principal Component", y = "Percentage of Variance Explained (%)") +
theme_bw()
pve_gg
Some people prefer to plot the cumulative percentage of explained variance:
summ_pca <- summary(pca_out_scaled)
cum_pct <- summ_pca$importance["Cumulative Proportion", ] * 100
cum_pct_df <- data.frame(principal_component = 1:length(pve), cum_pct = cum_pct)
cum_pct_gg <- cum_pct_df %>%
ggplot(aes(principal_component, cum_pct)) +
geom_point() +
geom_vline(xintercept = 4.5, linetype = "dashed", color = "darkblue") +
scale_x_continuous(breaks = 1:length(cum_pct)) +
scale_y_continuous(limits = c(0, 100)) +
labs(x = "Principal Component", y = "Cumulative Percentage of Variance Explained (%)") +
theme_bw()
cum_pct_gg
Thus, we conclude that the first 4 PC are significant. Let us express each cell as a vector of the first 4 PC:
toy_reduced <- pca_out_scaled$x[, c("PC1", "PC2", "PC3", "PC4")]
gene_loads_reduced <- pca_out_scaled$rotation[, c("PC1", "PC2", "PC3", "PC4")]
pheatmap(toy_reduced, cluster_rows = FALSE, cluster_cols = FALSE, angle_col = 45)
Now we have reduced the dimensionality of the dataset. To interpret each PC, let us inspect the gene loadings:
significant_pcs <- c("PC1", "PC2", "PC3", "PC4")
loadings_gg <- map(significant_pcs, function(x) {
loadings <- gene_loads_reduced[, x]
df <- data.frame(gene = names(loadings), score = loadings)
p <- df %>%
ggplot(aes(fct_reorder(gene, score), score)) +
geom_point() +
labs(x = "", y = x) +
theme_bw() +
coord_flip()
return(p)
})
loadings_gg
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
Interpretation of the PC:
Clustering means classifying observations into groups that minimize and maximize intra-group and inter-group distances, respectively.
To that end we need a concept of distance, which basically measures how similar or different two vectors are. In our case we will use Euclidean distance, but be aware that there are a plethora of different options that can yield different clustering results.
Let us start by computing all pairwise distances:
dist_mat <- dist(toy_reduced, method = "euclidean")
pheatmap(dist_mat, cluster_rows = FALSE, cluster_cols = FALSE)
hclust_average <- hclust(dist_mat, method = "average")
plot(
hclust_average,
labels = rownames(toy),
main = "Average Linkage",
xlab = "",
sub = "",
ylab = ""
)
plot(
hclust_average,
labels = rownames(toy),
main = "Average Linkage",
xlab = "",
sub = "",
ylab = ""
)
abline(h = 2.5, col = "red")
hc_clusters <- cutree(hclust_average, h = 2.5)
hc_clusters
## cell 1 cell 2 cell 3 cell 4 cell 5 cell 6 cell 7 cell 8 cell 9 cell 10
## 1 1 1 1 2 2 2 3 3 3
## cell 11 cell 12 cell 13 cell 14
## 4 4 5 5
table(hc_clusters)
## hc_clusters
## 1 2 3 4 5
## 4 3 3 2 2
annotation_rows <- data.frame(cluster = as.character(hc_clusters))
rownames(annotation_rows) <- names(hc_clusters)
pheatmap(
toy_centered,
cluster_rows = FALSE,
cluster_cols = FALSE,
angle_col = 315,
annotation_row = annotation_rows
)
| Cluster | Cell type |
|---|---|
| 1 | T-cells |
| 2 | Monocytes |
| 3 | Naive B-cells |
| 4 | Plasma Cells |
| 5 | poor-quality cells |
annotated_clusters <- c("T-cells", "Monocytes", "Naive B-cells", "Plasma Cells", "poor-quality cells")
names(annotated_clusters) <- as.character(1:5)
annotation_rows$annotation <- annotated_clusters[annotation_rows$cluster]
pheatmap(
toy_centered,
cluster_rows = FALSE,
cluster_cols = FALSE,
angle_col = 315,
annotation_row = annotation_rows
)
sessionInfo()
## R version 4.0.1 (2020-06-06)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] forcats_0.5.0 stringr_1.4.0 dplyr_1.0.2 purrr_0.3.4
## [5] readr_1.3.1 tidyr_1.1.2 tibble_3.0.3 ggplot2_3.3.2
## [9] tidyverse_1.3.0 pheatmap_1.0.12 BiocStyle_2.16.1
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.1.0 xfun_0.18 haven_2.3.1
## [4] colorspace_1.4-1 vctrs_0.3.4 generics_0.0.2
## [7] htmltools_0.5.0 yaml_2.2.1 blob_1.2.1
## [10] rlang_0.4.7 pillar_1.4.6 withr_2.3.0
## [13] glue_1.4.2 DBI_1.1.0 RColorBrewer_1.1-2
## [16] dbplyr_1.4.4 modelr_0.1.8 readxl_1.3.1
## [19] lifecycle_0.2.0 munsell_0.5.0 gtable_0.3.0
## [22] cellranger_1.1.0 rvest_0.3.6 evaluate_0.14
## [25] labeling_0.3 knitr_1.30 fansi_0.4.1
## [28] broom_0.7.1 Rcpp_1.0.5 scales_1.1.1
## [31] backports_1.1.10 BiocManager_1.30.10 magick_2.4.0
## [34] jsonlite_1.7.1 farver_2.0.3 fs_1.5.0
## [37] hms_0.5.3 digest_0.6.25 stringi_1.5.3
## [40] bookdown_0.20 grid_4.0.1 cli_2.0.2
## [43] tools_4.0.1 magrittr_1.5 crayon_1.3.4
## [46] pkgconfig_2.0.3 ellipsis_0.3.1 xml2_1.3.2
## [49] reprex_0.3.0 lubridate_1.7.9 rstudioapi_0.11
## [52] assertthat_0.2.1 rmarkdown_2.4 httr_1.4.2
## [55] R6_2.4.1 compiler_4.0.1